MultiMASC: An Open Linguistic Infrastructure for Language Research

نویسنده

  • Nancy Ide
چکیده

This paper describes MultiMASC, which builds upon the Manually Annotated Sub-Corpus (MASC) (Ide et al., 2008; Ide et al., 2010) project, a community-based collaborative effort to create, annotate, and validate linguistic data and annotations on a broad-genre open language data. MultiMASC will extend MASC to include comparable corpora in other languages that not only represent the same genres and styles, but also include similar types and number of annotations represented in a common format. Like MASC, MultiMASC will contain only completely open data, and will rely on a collaborative community-based effort for its development. We describe the possible ways in which additional corpora for MultiMASC can be collected and annotated and consider the dimensions along which “comparability” for MultiMASC corpora can be determined. Because it is unlikely that all language-specific MultiMASC corpora can be comparable along every dimension, we also outline the measures that can be used to gauge comparability for a number of different

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ontologies for a Global Language Infrastructure

With the recent developments of the Semantic Web and progresses of the associated methodologies and standards, demands for an open and distributed infrastructure for sharing language resources and technologies can be addressed now on a new basis (Buitelaar et al., 2003; Calzolari, 2008). In this article, we call such an infrastructure a global language infrastructure (GLI). GLI should accommoda...

متن کامل

Building an open-source development infrastructure for language technology projects

The article presents the Giellatekno & Divvun language technology resources, more specifically the effort to utilise open-source tools to improve the build infrastructure, and the solutions to help adapt to best practices for software development. The article especially discusses how the infrastructure has been remade to cope with an increasing number of languages without incurring extra overhe...

متن کامل

Technical Infrastructure at Linguistic Data Consortium: Software and Hardware Resources for Linguistic Data Creation

Linguistic Data Consortium (LDC) at the University of Pennsylvania has participated as a data provider in a variety of governmentsponsored programs that support development of Human Language Technologies. As the number of projects increases, the quantity and variety of the data LDC produces have increased dramatically in recent years. In this paper, we describe the technical infrastructure, bot...

متن کامل

Baltic and Nordic Parts of the European Linguistic Infrastructure

This paper describes scientific, technical, and legal work done on the creation of the linguistic infrastructure for the Nordic and Baltic countries. The paper describes the research on assessment of language technology support for the languages of the Baltic and Nordic countries, work on establishing a language resource sharing infrastructure, and collection and description of linguistic resou...

متن کامل

A New Paradigm of an Open Distributed Language Resource Infrastructure: The Case of Computational Lexicons

The need of ever growing language resources calls for us to propose and promote a change in the overall model of how to build, maintain and share Language Resources (LRs). A new paradigm is required, under the form of an open, distributed and collaborative language infrastructure, based on open content interoperability standards. Existing experience in LR development proves that such a challeng...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012